A formal model for dataflows, runs of dataflows, and provenance within runs

نویسندگان

  • Natalia Kwasnikowska
  • Jan Van den Bussche
چکیده

Modern scientific research is characterized by extensive computerized data processing of laboratory results and other scientific data. Such processes are often complex, consisting of several data manipulating steps. We refer to such processes as dataflows, to distinguish them from more general workflows. General workflows also emphasize the control flow aspect of a process, whereas our focus is mainly on data manipulation and data management. Important data management aspects of scientific dataflows include, among others:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Formal Model of Dataflow Repositories

Dataflow repositories are databases containing dataflows and their different runs. We propose a formal conceptual data model for such repositories. Our model includes careful formalisations of such features as complex data manipulation, external service calls, subdataflows, and the provenance of output values.

متن کامل

Análise de Estratégias de Acesso a Grandes Volumes de Dados

The efficient processing of big data has become an issue for several areas. In science, researchers have used dataflows to express computational analysis and experiments on data. An important feature of scientific dataflows is that the analysis must scan a large set of data. In this sense, this work investigates alternatives for storing large volumes of data favoring the execution of dataflows ...

متن کامل

MULTI-AGENT INFORMATION PROCESSING AND ADAPTIVE CONTROL IN GLOBAL TELECOMMUNICATION AND COMPUTER NETWORKS A.V.Timofeev

The problems and methods for adaptive control and multi-agent processing of information in global telecommunication and computer networks (TCN) are discussed. Criteria for controllability and communication ability (routing ability) of dataflows are described. Multi-agent model for exchange of divided information resources in global TCN has been suggested. Peculiarities for adaptive and intellig...

متن کامل

Optimizing ETL Dataflow Using Shared Caching and Parallelization Methods

Extract-Transform-Load (ETL) handles large amount of data and manages workload through dataflows. ETL dataflows are widely regarded as complex and expensive operations in terms of time and system resources. In order to minimize the time and the resources required by ETL dataflows, this paper presents a framework to optimize dataflows using shared cache and parallelization techniques. The framew...

متن کامل

SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows

Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These dataflows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such dataflows are user-defined predicates or functions (Udfs). However, the heavy use of Udfs is not well taken into account for dataflo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007